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Problems and pr...M . iur&s in assessing and obtaining fit of 
data to the Rasc> de- are treated in the paper. The 
assumptions eitibodi^id i::. the r^odel are roade explicit anzi 
it is concluded -hhat s-^atist-cal tests are needed wh^^t 
are sensitive to deviations such that ir.ore than one i ?tem 
parameter would e r eec^^d for each iter , and such th o t more 
than one person ""^rp^ier^, r would be needed far each f '/stn* 
Statistical goodizass-zr'''— f it tai::ts, based on the cone tiional 
maximum .likelihOL i estii lates the item pHrameters, .:-".ich 
can detect these r>^o <c:::jids of deviatior are presente . 
Common sources cz l a-ei^ition ^ire also ic=g="ir led . as '^ the 
tests needed to d^-'^^ them. Problems the use of -^citis- 
tical tests to ' ri-;: ar..- di scusae-' :nd some isi' - ^^i^^^ga- 

tions of power -^e si.s^.r,ed. rela-zi tv a disir';; rlon 
between use of =sr- lodi .i as a crat^ion ar.d as -n 

ii^trument the ^nernr of \.''\^ groodnerss-^f-f it pr-^'- in 

drfferent meas-^re ^r tc^nioji-^s discu^ed. FinallL r is 
cc -eluded that ^r.^'<^ _ch car '^r idei'iziiified ^s ' .. \f it zing 
^;/. .Id not be r:i>u j-. r ude: t:o c - -in f: to " ^- "^odel; 

:iead other a^l .Iz o--en be ::^en s ch u._. rczxping 

o -.'if'/O items in^^o Teneous s^rosets , 



Introduction 



Theorists and- practici^on* 's ure to an iL-ncreasing extent focussing 
attention on >7hat is called the latent -rait (ZZT) models within 
ttest theory Baker, 1977? Hainbleton i Zook, 19~7) . The LT models 
specify a r&c atijonshi'/^ between: obser'\-^arJ.e examinee performance 
snd an unobs^in^able xraitr, a3sn™ed to :^erlie performance. Their 
great power ems frr>nA t.e fact that ^srrameters describing 
criaracterist"^ ^ C3 of ±±xe test ittems can ir^ estimaited in such a way 
tnat they a:- ^ invariant frotfn one group of persons to another, 
and estimai^ of the abi-1 Uy of persomi: can be nade in such a 
way that tte -y axs invariarrt fror one ssple of sterns to another. 

The family ' LT mndr-is haF many menbe:^s (Lord No\'ick, 1968; 
Hambleton, owc-nlnat lan. Ca<ok, Ex-nor: i Gifford, . -77) but the 
most important mes ^. m to be certain models for iichotomous 
items, based on loc isr.jc functrcns. Tiie simplest r:^ these is 
the Rasch model (Rasch, i960, 1966), or the l-pHi:rar>eter logistic 
model. In the F^s^h one p-irameter only is ^t<:=:-' to describe 

each iteni, but r-^^re^ are also other inodcls such air T±[e 2- and 
3- parameter r.od--_= (airnbaum, 1968), '-.n which £±iiltlonal para- 
meters are used aescrih^ characteriis ics of i±:ems. 

The Rasch model 5S important tneoret^^^ 1 and pracricaJL advan- 
tages when it zzip.es to :?Te estxjratior r: parameters Andersen, 
1973a; Fischer 197^; GustafssOji, 19 ). The relative simplicity 
of the Rasch model ^^iso m-^^^^s it eas3: t:o apply the mociel in 
solving practical Tueasur^r erxt problei^Sr such as linki-^g and 
equating tests op-^ zir ^ te^^^s, carrrying c^t tailored testing, 
constructing item ^^nK anil sc on (c.**- Wright, 1977a . These 
reasons are sufficient ^ cr Ja^in why the Rasch model is the LT 
model most frequently r ~^p' ie:. 

The LT models have verv der^irable characteristics which make 
possible the solution oi -^^o^surement problems which are diffi- 
cult or impossible to sol^-^ w rhin the framework of classical 
test theory. But the mod^"'-- - tail strong assumptions about 
the nature of the data, liinc .nless these assumptions are ful- 
filled, the validity of zh^ results of applications is 
endangered. The Rasch rocel the most constrained one, and 



it is also the model which entails the strongest assumptions. 
The question of whether the data fit the models or not is 
therefore of great importance. 

More specifically, there are three reasons why the question 
of fit is an important one. In the first place, it is important 
to realize that if the assumptions are fulfilled for a set of 
data, then all the desirable characteristics of the LT models 
are logical implications of the mathematical structure of the 
models themselves; the validity of applications need therefore 
not be empirically proven if the data fit a model. Secondly, 
in some cases fit to a model is an important end in itself, 
because the models, and above all the Rasch model, formalize 
desirable characteristics of measurements (cf. Gustafsson, 
1977; Wright, 1977b). Thirdly, in those cases where, for some 
reason, it Is necessary to use an LT model without the data 
fitting it, it is essential that the deviations from the model 
are reasonably well-known, since different applications are 
endangered to different degrees depending on the type of 
deviation. 

This paper deals with the problems of assessing and obtaining 
fit of data to the Rasch model. This model is concentrated 
upon because of the advantages it has over other models, and 
because it entails the strongest assumptions. 

Ever since the model was first formulated by Rasch (19 60) the 
problem of fit has bec^n studied, and statistical tests of 
goodness-of-f it have been developed (Wright & Panchapakesan, 
1969; Andersen, 1973b? Martin-Lof, 1973; Fischer, 1974; Mead, 
1976a, 1976b) . But there are factors which motivate another 
treatise on the subject. 

The development of computational algorithms (Gustafsson, 1977, 
1979) has made another class of statistical tests of goodness- 
of-f it available for general use. These are based on the 
conditional maximum likelihood approach to estimation of item 
parameters in the Rasch model (Andersen, 1973b; Martin-Lof, 
1973) , and they have better statistical properties than most 
other goodness-cf-f it tests. Even more important, however, is 



the fact that there are such tests, only described in an un- 
published report by Martin-Lof (1973), which are sensitive 
to deviations from the model that are difficult to detect 
with other methods. These statistical tests are presented in 
the paper, along with a presentation of the conditional approac'?- 
to estimating the item parameters. 

The sensitivity of different statistical tests to different 
sources of d^viatian from the Rasc±. model has not been much 
studied, and an attempt is made tc shed light upon this problem 
A few studies of the power of the goodness-of-£it tests as a 
function of factors such as sample size and number of it^ms 
are also presented. 

Closely associated with the problem of fit is the question cf 
the robustness of the model. Analyses of that problem are 
presented in relation to one particular kind of application 
the equating of tests. 

Strategies used to obtain fit of data to the model are also 
discussed and problems inherent in the most commonly used 
strategy are identified. On the basis of that criticism an 
alternative strategy is outlined. 



1. The Rasch model and its assumptions 



According to the Rasch model, the probability of a correct 
answer to an item is a function of two parameters only, one 
describing the difficulty of the item (a^, i=l,...,k) and 
one describing the ability of the person taking the test 
(5^, v=l,...,n). If we denote a correct answer to item i by 
person v as 'the probability of this outcome is: 

(1.1) P(A^.=llV^i) = ^^P^^v""i^ 

1 + exp(r. -o . ) 



The item characteristic curve (ICC) is a central concept 
in LT theory. The ICC is the function relating the probabi- 
lity of a correct answer to an item (i) to the abilitry 
O variable (C). From (1.1) follows that the ICC for an item in 




the Rasch -model is: 



(1.2) 



The ^CC is a: furc. jn of one parameter only, tha ^tam para'n^eter, 
or ^ rerm -T-nic: v.. . be used interchajigeably , t±ie :ij^f iculit:y . 

In -rz^ Rascr. -rxuc^l u'^idimensionality _::s assumed since t ^.e:: r 
is 'cr. y one -joarsre:- ^>ir cf ability. Hovtever, we w~ 1 1 ne^"^ ^ nore 
exac* iefin^L c :niidiinensionality . Lord and Nc:jyic:k 15'68) 
pres' ed tr^c de :i -idLcrns of unidimeirsionality ±n IZT inc* oLs- 



One CI the: : ler rioiis (p. 359) is actually a dsixmr-ioir of 

dimena,zx)na_ary — ly order but here ±± has, along :wi. -n scsne 

change ir -n^ :a' ^on been rewritten as a def initio'\ um- 

dimers-onai ity : 

IIbnsid?t=ir. a se ' of k items and one latent trai^.:^ ^ which 
^ffec**^ exa-:: rne performance on all items in set. 

can T3W re^^esent each exam.inee as a point on the 
trrait* :Qext , onsider all the examinee populations that 
be of in*: -rest for this set of k items. Assume that 
^^ch a-:em is dministered just once to each examinee, 
r-i£id cc asider The conditional frequency distribution 
(over people ' of. item score for any fixed value of c,. 
T± thi s ( unc.:5Brvable) distribution is not :the sam.e for 
-ill t|' popu-Rt-j.ons of examinees, then ther^- must be one 
' r ir^ i psycr.ological dimension in addition to E, , that 
^scrzininate ainong the populations of internist . In 
efinzung the complete latent space , therefcrrra, we must 
iclucfe these additional dimensions. Thus, ^ definition , 

1 tho complete latent space the conditioncL,. distribu - 

' —on jf item score for fixed g is the same ior all 
.y—mul^xtions of interest . 



From t IS definition of unidimensionality follows That the 
ICC for an item is invariant for those populations^ used to 
define T::ne complete latent space. 

The c^her definition of unidimensionality given by iLord & 
Novicx (lir»o3) is founded on the concept of local (c=r condi- 
tional' statistical independence. If we use the ali^braic 

notation A .=a . to represent the score, 0 or 1, or person v 
v: VI ^ ^ 

to item i , the assumption of local statistical independence 



can be written: 

k 

(1.3) ^'(^vl=^vl'\2=S2 \k=%ki v) = .2^ Pt^vi=V_ ^) 

Thus/ if locail :LndependeTrTrre holds the probability of an e^iamiinee 

response pattrern is givenr. by the product: of the probabilit j.= of 

the item resp^ ^ises, and "-e Lord and Nc— ick (1968, p. 54 0) cr^fini- 

tion says that if (1.3) :c Ids for some -eal-valued trait : the 

measurCTients ' aztisfy a i: -:Ldiinensional latent-trait model. 

This abs^rac:^ ief inition _s iven a more concrete meaning -iien 

formula:te^d ^ £ 3d11ows: 

" individual's pe rmance depends on a single und^sr- 

l\:i r^q :r:aj.t if, giver T)i:s value on that trait, nothing 

^ can be learnec- from him that can contribute tc rhe 
esr 'Xt±on of his per-farmance . The proposition is thzt 
trr: it lent trait is th e only important factor and, once a 
p=£:3QT s value on the rrait is determined, the behavior is 
rrn::dor in the sense o^, statistical independence" (Lord & 
-ic 1968, p. 538) 



The dlifererce betwe en the '0 definitions of unidimensionality 
is tha the latter one ex; ,citly introduces the assumption of 
local statistical indepen- ce. However, there is no conflict 
betwe&- thie nwo def initiorii^ since the assumption of local 
stati^z^cal independence equivalent to the assumption that 
the iHi^ent variable under jonsideration spans the complete latent 
space Lcrd Ncvick, 1968, p. 361). 

It is also necessary to consider another attempt to define uni- 
dimensionality. Lumsden (1978) formulated a statistical model 
in which both items and persons are located on an attribute 
continuum (latent variable) . In contrast to the Lord & Novlck 
approach the items, and not the persons, have a point location 
on the continuum, while the persons are assumed to have a 
distribution of attribute locations, resulting from moment to 
moment fluctuations. The distribution of attribute locations 
is not assumed to be the same for all persons, taking into 
account the possibility that persons may differ in reliability 
(Lumsden, 1977) . 

In the Lumsden formulation, unidimensionality of the items is 
Q assured by the fact that they are located on the same attribute 

ERIC ^ 



contiTOium. However, the Lximsden model is not uridiaEtensional in 
the sense of Lord and Novick'^s definition, vhic±i is best seen 
if the Lumsden formulation is formalized. One vcay to do this 
is to generalize the Rasch model, taking into Hz:count the 
possibility of varying person reliabilities, adding another 
parameter for each person, v=l,,..,n), whiich, in accor- 

dance with Mead (1976b), will be referred to the sensitivity 
parameter: 



expip "Cj . } 
1+ Gxp'^j (r. -a. 

^ V V 1 

This model will be referred to as the Lumsden .nodel. In this 
model the PCC's (person characteristic curves, i.n which for 
each person the probability of a correct ansv:.-^ is shown as 
a function of item difficulty, are not parallel , with the 
sensitivity parameter reflecting the slope o: the PCC. 

In the Lumsden model knowledge of a person's ability parameter 
E,^r which can be interpreted as the mean of :iis distribution 
of attribute locations, could not alone explain his performance,, 
since the sensitivity parameter would also be needed. Therefore, 
the Lumsden model is not unidimensional in the sense of Lord 
and Novick's definition of the term. 

It is in all likelihood impossible to obtain separate estimates 
of the and the parameters, so the Lumsden model is not 
useful as an LT-model. It is useful, however, as an alternative 
model to the Rasch model in investigations of fit since it does 
specify a certain kind of multidimensionality . 

Another assumption in the Rasch model is that all the items have 
the same discriminative power, i.e. that all the ICC ' s are 
parallel. The meaning of this assumption is most clearly seen 
if the Rasch model is contrasted with the Birnbaum (1968) model, 
or the 2- parameter model which introduces another parameter 
for each item, the discrimination parameter (a^, i=l , . . . ,k) . 
According to the Birnbaum model, the ICC for an item is: 
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(1.5) 



1+ expa^ (C-a^) 



The discriminat:icrr parameter reflects the relation between 
performance on it: _^em and the latent variable, and if the 
discrimination rr-: -:iueter is different among the items in a 
set/ the ice's '.w~ L be non-parallel. 

The discrimina.tian- parameters are different if the items in 
addition to th^:e common latent trait reflect different "specific" 
factors and/or if they are differently affected by random errors. 
However, it is- difficult to define a specific factor as opposed 
to another lat^^nt variable, of which Lord and Novick (1968) were 
aware : 

The psychometrician is likely to wish to define his comp- 
lete latent space to include all "important" psychological 
dimensions that affect performance on a given set of items 
and to e^TC-lude those variables that comprise "errors of 
measurement". Unfortunately, it seems logically impossible 
to distinguish objectively those variables that are simply 
"errors of measurement" from those that are not (p. 340) . 

Thus, even at the level of formal definition it is difficult to 
make a distinction between the assumption of unidimensionality 
and the assumption of homogeneous item discrimination. 

The specific factors are assumed to be uncorrelated with the 
ability measured by all the items and with the specific components 
of all the other items. However, even though the specific factors 
can fulfill the assumptions of orthogonality when we confine our 
attention to a specific sample of items, they are not likely to 
do so in the "population of items" (cf. Lumsden, 1978). Since 
generalization is practically always intended beyond a certain 
set of items, the distinction between the assumption of uni- 
dimensionality and the assumption of homogeneous item discrimina- 
tion becomes even more difficult to uphold. 

Three assumptions in the Rasch model have been discussed: the 
assumption of unidimensionality, the assumption of local statis- 
tical independence, and the assumption of homogeneous item 
descrimination; these are also the assumptions commonly associated 
with the Rasch model (cf. Gustafsson, 1977; Hambleton et al.,1977). 

If) 



It has been concluded, however, that the assumptioi. of uni- 
dimensionality and the assumption of local statistical 
independence are either identical, or inseparable, and also 
that it is difficult to uphold any clear distinction between 
the assumption of homogeneous item descrimination and the 
assumption of unidimensionality . 

It does seem that the Rasch model assumptions can be violated 
in basically two ways: either a model is needec' to describe 
the data which contains two or more parameters for each person, 
which would be a violation of the assumption of unidimensionality ; 
or a model is needed which contains two or more parameters for 
each item, which would be a violation of the assumption of the 
foxm of the ICC's; or, of course, a combination of these. 

If the Rasch model holds true for a set of data the item parameters 
are invariv^nt from one group of persons to another and the person 
parameters are invariant from one group of item to another. But 
if more than one parameter is needed for each person, such as is 
the case in the Lumsden model, for example, the person parameters 
will not be invariant for groups of items. If more than one para- 
meter is needed for each item, such as is the case in the Birnbaum 
model, for example, the item parameters will not be invar rant for 
groups of persons. This forms the basic rationale for the statis- 
tical methods of investigating fit to the Rasch model. The statis- 
tical tests will be taken up later on, after the basics of the 
conditional maximum likelihood approach to estimating the item 
parameters in the Rasch model have been presented. 

2> The conditional approach to the Rasch model 

There are several different approaches, ranging in mathematical 
and statistical sophistication, to the problem, of estimating 
the parameters in the Rasch model from a set of observational 
data* There are, for example, simple methods suited for hand 
calculations (e.g. Wright & Douglas, 1975). But these methods 
introduce further assumptions, such as an assumption of normality 
of the distribution of person parameters, and these methods are 
only approximate. When the user of the model has access to a 
computer, better methods of estimation become available. 



The most commonly used methods of estimation are based on the 
maximum likelihood approach. However, two entirely different 
maximum likelihood estimators have been defined for the item 
parameters in the Rasch model. One is what is called the 
unconditional maximum likelihood (UML) approach in which the 
item parameters and the person parameters are estimated 
simultaneously. (Wright & Pachapakesan, 19 69; Wright & Douglas, 
1977) . The other is what is called the conditional maximum 
likelihood (CML) approach, in which x:he likelihood function for 
estimating the item parameters is expressed in the item para- 
meters only, through conditioning on raw score (Andersen, 1973a; 
Fischer, 1974; Gustafsson, 1977) .Only in the Rasch model is this 
possible, because raw score is a sufficient estimator of the 
person parameter. 

Only the CML estimator yields consistent estimates of the item 
parameters (Andersen, 1973a; Fischer, 1974), but the UML estimator 
is the one most commonly used (Wright & Douglas, 1977). There 
are two reasons why the theoretically inferior UML estimator has 
been used instead of the CML estimator. In the first place, the 
CML estimates are computationally more cumbersome than the UML 
estimates and they have even been impossible to compute for 
anything but short tests. Secondly, it has been shown (Wright 
& Douglas, 1977) that if a correction is made of the UML estimates, 
they come close to the CML estimates. 

If the similarity between the estimates obtained with the UML 
and CML approaches was the only issue in the choice between the 
two approaches, ' .ere would be little reason to use the CML 
approach. There is, however, another, more important difference. 
On the basis of the CML approach it is possible to devise 
efficient statistical tests of fit with known statistical pro- 
perties, v;hile under the UML approach only approximate statistical 
tests have been formulated. 

The computational problems in relation to the CML algorithm have 
recently been solved (Gustafsson, 1977, 1979) so that now the 
CML estimates can be obtained for long tests (80-100 items, say) 
as well, and most often with a relatively limited amount of 
Q computational work. Therefore only the CML approach will be 
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considered in detail in the sequel of this paper. 



The mathematical notation becomes greatly simplified if an 
antilogarithmic transformation is made of the parameters in 
the Rasch model such that 6^=exp(C^) and =exp(-a^) . The 
probability of observing the outcome ^^j^'^^^j^ then be 

written: 

(9 e.)^vi 

(2.1) P(A .=a . I 6 ,e . )= ^ ^ 

^ ' VI Vl' V l' 



1+0 e. 

V 1 

We want tc estimate the parameters from the answers of n 
persons to k items, and assemble the scores into the matrix 
( (a^^) ) . The raw score for person v is: 

k 

(2.2) r = E a . 

V .^^ VI 

and the total number of correct responses to item i (the item 
score) is: 

(2.3) s. = ? a^. 

v=l 

Those persons who have 0 or k correct answers must be excluded 
from the ((^^j^)) niatrix since no estimates of their parameters 
can be obtained and items with 0 or n correct answers must be 
excluded for the same reason. 



Consider first a given examinee with raw score r^ and person 
parameter 9^. Given a set of items with parameters (e^) , the 
probability that this examinee obtains any score vector (^^j^) / 
assuming independence of the responses/ is : 

, \a. GvIIe.vi 

k (0,,e.) VI V I 

,2.4) Pna^,)|e^,,e,))= n ^ = i 

1+0 z . n (1+0 z . ) 

VI . V 

1 

To be able to express this probability as a conditional proba- 
bility, given score r^, we must know the probability of obtain- 
ing score r^ given 0^. This latter probability is given by the 
sum of the probabilities of all possible ways of obtaining the 
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score r^, that is the sum of all the expressions such as 
(2 .4) in which the vector (a^j^) sums up to r. 



A given score r obtained on k items can of course be obtained 
in ^l^j different ways. We will need a special notation to be 
able to express this in a simple way. Define: 

k a . 

(2.5) y^iic.)} = E n c 

y.a .=r i=l 

i 



The y^iiz)} {or, for short, y^) is called the elementary 
symmetric function of order r in the parameters (^j) . 

We can now write the probability of obtaining the score r, 
given 0^ and ( z^) : 

k (O^e )^vi ejv 

P{ri6^ .{£.)}= E n = ^ ^ 

,^ VI ^ 1+0 e. n (1+9 e.) 

(2.6) ^ VI VI ^ ' V 1 



Thus the conditional probability of obtaining any vector (a^j^) 
with the total score r^, given the score r^, is: 

P{ (a,^^) |r, (e^) }= = 

^^•^^ P{r|0^,(e^)} Yj. 

If independence is assumed between examinees, the conditional 
likelihood of the data, matrix { (\^) ) is easily obtained. The 
logarithm of the likelihood function can be shown (Fischer, 
1974? Gustafsson, 1977; Wright & Douglas, 1977) to be: 

k k-1 
(2.8) logA= E s.loge.- Z n logy^ 

i=l ^ ^ r=l ^ 



where n^ is the number of persons with raw score r. 

Estimation equations for the item parameters can be derived 
from (2.8) (Fischer, 1974; Gustafsson, 1977? Wright & Douglas, 
19 77) . The greatest problem in solving the equations, which 
must be done iteratively, lies in efficiently and accurately 



computing the y and their first derivatives with respect to 

each of the items (y i) and sometimes also their second deriva- 

^" (i -j) 

tives with respect to the items two at a time (y ). However, 

as was shown by Gustafsson (1977, 1979) it is possible to devise 

recursive formulas that can manage these tasks. 

It is not necessary to treat methods for estimating the person 
parameters since it is possible to avoid estimation of these 
in evaluations of fit. This is quite fortunate since the 
statistically correct method of estimating the person parameters 
is impossible to apply in practical work (Fischer, 1974, pp. 239 
240) . 

3. Goodness-of-f it tests for the Rasch model 

All the goodness-of-f it tests are based on the principle that 
implications of the model assumptions are tested against 
observable results. But there are several implications of the 
model assumptions and the tests can technically and statisti- 
cally be constructed in many different ways, so there are 
several goodness-of-f it tests for the Rasch model. 

Rasch (1960, 1966) showed that it is possible to devise a 
test of the model in which no use is made of estimated 
parameters. This test, which is a generalization of Fisher's 
exact test for a 2x2 matrix is, however, computationally so 
cumbersome that it has as yet proven impossible to put it 
into practical use. Therefore, all the goodness-of-f it tests 
in practical use, employ estimates of parameters in the model, 
and tests based on the UML- and CML-approaches differ greatly. 

Wright and Panchapakesan (19 69) developed within the framework 
of the UML-approach, a test of overall fit and a test of item 
fit based on comparisons between observed and theoretically 
expected frequencies of correct answers to each item at 
different levels of ability. Mead (1976a, 1976b) extended 
this approach into a method based on analysis of residuals in 
the fitted model, using analysis of variance procedures and 
plots of the residuals. This procedure allows detection of 
different types of deviation from the model, such as guessing, 



speededness and learning effects. 



The distributions of the test-statistics formulated within 
the framework of the UML-approach are unknown, however. The 
chi-square and z-distributions have been relied upon, but 
simulation studies indicate that even though the means of 
the distributions conform to the expected ones, the variances 
may depart substantially (Mead, 1976b) . 

Within the framework of the CML-approach goodness-of-f it tests 
have been formulated (Andersen, 1973b; Martin-L6f, 1973) which 
have at least asymptotically knov/n distributions, and which have 
been shown to be parametric counterparts to Fisher's exact test 
(Martin-L6f , 1974b) . These tests are presented below. 

Tests sensitive to variations in the ICC's 

It has already been concluded that if a set of data fit the 
Rasch model, the item-parameters (or the ICC's) will be in- 
variant for groups of persons. Andersen (1973b ;cf Martin- 
Lof , 1973) has presented a conditional likelihood ratio test 
of model fit from this starting point. 

To compute this test the item parameters are estimated using 
the total sample of persons, and also within g disjoint 
subgroups of persons with n^ ( j = l, ... ,g ) persons in each. In 
each estimation of the item parameters a maximum of the 
logarithm of the likelihood (2.8) is obtained. We can call the 
maximum obtained for the total group of persons and the 
maxima obtained for the subgroups (j=i,...^g)^ The following 
test statistic can then be written: 

g 

(3.1) logA = H - Z H. 

^ j = l ^ 

Andersen (1973b) has shown that -21ogA is asymptotically 
chi-square distributed with (g-1) (k-1) degrees of freedom 
when each n^-^- 



This test is sensitiv- to differences in the ICC's for different 
groups of persons and will therefore be referred to as the A-ICC 
test. However, the persons can be grouped according to different 
criteria, and depending upon how the grouping is done the test 
is sensitive to different violations of the model assumptions. 
One possibility is to group persons according to level of per- 
formance on the test, i.e. according to raw score. When used in 
this way the test is sensitive to variations in the slopes of 
the ICC's, i.e. it guards against the alternative hypothesis 
that the Birnbaum model, or a model with even more parameters 
for each item, would be needed to describe the data (cf. Andersen, 
1973b) . We will use a special name of the test for this important 
kind of application: the A-ICCSL test, with the postfix SL chosen 
to indicate that the test investigates the homogeneity of the 
slopes of the ICC's. 

But the persons can also be grouped according to other criteria 
such as sex, social background, or school, just to mention a 
few. When used in this way the A-ICC test is a test of uni- 
dimensionality since it follows directly from the definition 
of unidimensionality that the ICC for an item must be invariant 
for groups of persons. This holds true in particular when the 
grouping is not confounded with level of performance since then 
the test would also be sensitive to variations in the slopes 
of the ICC's. 

Martin-Lof (1973, pp. 128-129) has suggested another test which 
is sensitive to variations in the slopes of the ICC's and it 
will be referred to as the ML-ICCSL test. This test is asympto- 
tically equivalent with the A-ICCSL test but it is of quite a 
different construction. In the ML-ICCSL test the item parameters 
are only estimated for the total group, and the test is computed 
from the differences between observed and predicted frequencies 
of correct answers for persons with different raw scores { score 
groups) . 

Let n^^ denote the observed frequency of correct answers to 
item i for those persons who have r correct answers. A corres- 
ponding predicted frequency can also be determined: The conditional 
probability that a person with raw score r answers item i correctly 



£ ^(i) 
i r-1 

can easily be shown to be — . Therefore , if the model is 

^r 

true for the data, the following relationship should hold 
approximately true: 



(3.2) 



n , 



ir 



The ML-ICCSL test takes as its starting point this relation- 
ship and from the deviations between observed and predicted 
frequencies a chi-square sum is built up. 

n- 



If we label the vector 



*lr 



n 



kr 



= (qj.) and call the corresponding 



vector of predicted frequencies 
tics is: 



^r^l^r-l 



r k'r-1 



= (tv.) the test statis- 



k-1 



(3.3) 



-1, 



T= E {(q^)-(t^)}'{((V^))} "{(q^)-(t^)} 
r^l r r r r r 



in which quadratic form ( (V^) ) is a variance- covariance matrix 
of order kxk with elements defined as follows: 

(i 



(3.4) 
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in the diagonal 
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Martin-Lof (1973) has shown that the test statistic is asymp- 
totically chi-square distributed with (k-1) (k-2) degrees of 
freedom when each n^-><». 

In (3.3) the suimnation is made over all score groups. If, however, 
some n^=0 we have to restrict the summation to those R groups 
in which n^>0. The degrees of freedom then are (k-1) (R-1) . 

When k is large, the test is quite tedious to compute since it 
requires computation of k-1 matrix inversions as well as the 
second derivatives of the symmetric functions. It can be noted, 
however, that the actual inversion of the matrices can be avoided: 
Scheffe (1959) has shown that the quadratic form can be computed 
by evaluating two determinants instead, which requires less 
computational work. 

The ICCSL tests give information about the homogeneity of the 
slopes of the ICC's, but they do not give any information of 
value concerning the reasons for poor fit. Due to the lack of 
a statistical test of item fit with a known distribution under 
the CML-apprcach, graphical methods have been resorted to. This 
is no great sacrifice, however, since the logic of testing the 
fit of single items can be questioned (see section 7 below), ajid since 
descriptive information is needed more than anything else. 

The relationship (3.2) can be rewritten so that it expresses a 
relationship between proportions of correct answers, instead 
o£ frequencies. If, for a fixed item, the observed proportion 
is plotted against the predicted proportion, the points for the 
score groups should fall along a straight line with a slope 
of unity, even though the points as a function of stochastic 
variation will be spread around the line of unit slope. This 
graphical test will be referred to as the GR-ICCSL test, since 
it is sensitive to variations in the slope of the ICC's. 
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The plots that are observed in applications of the model tend 
to have many different appearances. However, 3 different types 
account for the absolute majority of the patterns observed. 
The first is where the points actually fall close to the line 
of unit slope, and this indicates fit to the model. The second 
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type of pattern appears when the observed proportion of correct 
answers for the lower score groups is higher than the predicted 
proportion, while at the same L'ime, the observed proportion is 
lower than the predicted proportion for the higher score groups. 
A low discrimination parameter would be found for such an item 
if the Birnbaujfn model was applied. The third pattern, finally, 
appears when the observed proportion for the lower score groups 
is lower than the predicted one and when the observed proportion 
for the higher score groups is higher than the predicted one, 
and it reflects the case when the item has too high a discrimi- 
nation parameter. 

Test sensitive to variations in the PCC ' s 

The tests presented above all investigate the invariance of 
the item parameters for groups of persons and they will there- 
fore be referred to as ICC-tests. Practically all other tests 
of fit which have been used also belong to the group of ICC- 
tests and in particular to the sub-group of ICCSL- tests. It is 
easily shown, however, that there may be violations of the 
assumptions of the Rasch model which cannot be detected with 
these tests. Lumsden (1978),' for example, showed that the PCC ' s 
may be non-parallel, while the ICC ' s are parallel. 

To investigate the hypothesis that the Lumsden model, or another 
model with more than one person parameter is in fact needed to 
represent the observations, one could study the invariance of 
the person parameters for groups of items. However, a test 
constructed straightforwardly from this point of departure 
would have less than optimal characteristics, since a very large 
number of parameters would have to be estimated, and since it is 
practically impossible to estimate the abilities conditionally 
on item score. 

It is, however, not necessary to estimate the abilities to per- 
form the test. A conditional likelihood ratio test , founded 
on the CML estimates of the item parameters, which tests the 
hypothesis that two groups of items measure the same ability 
has been presented by Martin-L6f (1973, pp. 135-136; cf. Leun- 
Q bach, 1976). 
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To compute the test it is necessary that the items be grouped 
into two disjoint sets. Let us say that there are and ^2 
items in the two sets, respectively , and that kj^H-k2=k. Further- 
more, let n be the number of persons with raw score r, on 

12 ^ 
the fii:ist set and raw score on the second set. W: n ^he item 

paranscers for the total set of k items are estimat .aaxlmum 

of the logarittim of the likelihood function (2.8) : ained 

(H^) and when the item parameters are estimated for . set 

separately, the corresponding maxima H^^ and are ined. 
The following test statistic can then be formed: 

k^ k^ n n , 

12 r, r^ k n 

(3.5) logA= E n^ log — - — - + Z n log +H^-H,~H. 

r^=0 r2=0 ^1^2 ^ r=0 ^ ^ t 1 . 



Martin-L6f (1973) has shown that -21ogA is approximately chi- 
squar^^ distributed with kj^k2-l degrees of freedom when n-><». 

The "rest can be applied with the items grouped according to 
difr-erent principles, and depending upon how the items are 
grouped the test will be sensitive to different violations of 
the assumptions. One possibility is to group the items according 
to item score, i.e. difficulty. Then the test investigates the 
hypothesis that a model of the Lumsden type would be needed to 
account for the observations, i.e. that the person sensitivity 
parameters differ. In this special kind of application the test 
will be referred to as the ML-PCCSL test, since it tests the 
homogeneity of the slopes of the PCC's. 

But the test can also be applied with the items grouped according 
to different h^^pothesized dimensions. In this kind of application 
the test is, of course, a direct test of unidimensionality , and 
when used in this way it will be referred to as the ML-PCC test. 

It should also be pointed out that the test will also be sensitive 
to a difference in the mean value of the discrimination parameter 
for the two sets of items. Within the sets of items the discrimi- 
nations can vary, however, without this being detected by the test 
as long as the mean discrimination is the same. 
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With the possibility of varying person sensitivity parameters in 
mind/ the question of person fit to the model is actualized. 
Under the CML approach it is at least theoretically simple to 
construct a test of person fit. 

An expression has already been derived for the probability 
of obtaining any given score vector, given a certain raw score 
(2.7). A p-value is obtained if the probability of all more 
extreme score vectors, i.e. those with a lower or che same condi- 
tional prc3bability of being observed, is summed up. Unfortuna- 
tely this test is computationally cumbersome since even with 
few items the total number of possible score vectors is very 
large . 

A computationally more feasible test can be constructed if the 
items are grouped into sets. Consider the case when only two 
sets of items are used, with k^^ and ^2 items. Let us, for any 
given person, denote the raw score on the first set r^^ and the 
raw score on the second set r^, with r^+r2=r. Denote further 
the symmetric functions of the corresponding orders in the item 
parameters, estimated with both sets pooled, as ^r^:l '^r^il, 

respectively. It is then easily shown (cf. Leunbach , 1976) that 
the conditional probability of obtaining the raw scores r^^ and 

^r^:l ^r^:2 
(3.5) P(r^|r) = ' 



^r 



A p-value for the fit of the person is obtained if the probabili- 
ties of all equally or more extreme combinations of raw scores on 
the groups of items are summed up. 

A test of this kind is easy to compute. It can be suspected to 
have a low power, however, and the power would also be very 
different for different raw scores if the same grouping of 
items is used. Power can be increased however, if the test is 
generalized to more than two groups of items and if a different 
grouping is used for ec.ch raw score. 



22 



It must be pointed out that a test like this cannot be applied 
to all the persons in a sample, since the signdficance level 
would then be seriously disturbed. Only when a single randomly 
chosen person is observed does a statistical test of person 
fit have any meaning. 

4, Sources of deviation from the Rasch model — and how they 
are detected 

In the previous section we have seen how it is possible to 
devise statistical tests of the fit of data to the Rasch model, 
either through investigating the invariance of item parameters 
for groups of persons or through investigating the invariance 
of person parameters for groups of items. Both these groups of 
tests, the ICC- and PCC-tes-ts, are tests of unidimensionality 
but they are not equally sensitive to different deviations and 
a deviation that may be detected with one test may be impossible 
to detect with another test. 

There are a number of i.dentif iable sources of threat against 
the Rasch model, and it is of course of great interest to 
clarify which statistical tests are needed to detect different 
types of deviations. Such sources of deviation are discussed 
below. 

Item heterogeneity 

Item heterogeneity, in the sense that different groups of 
items measure different abil:Lties, is of course a violation 
of the assumption of unidimensionality. 

As long as there is some basis for an a priori grouping of 
the items according to different hypothesized dimensions the 
most straightforward way to investigate this kind of deviation 
is to use the ML-PCC test. This is also the method to be 
recommended, but we shall first see if there are other methods 
with which item heterogeneity can be detected. 
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In presentations of the Rasch model (e.g. Gustafsson, 1977) 
it has been implied that item heterogeneity can be detected 
with ICCSL tests. As long as the items measuring different 
abilities have different discrimination parameters the ICCSL 
tests do in fact detect item heterogeneity, but it is of course 
conceivable that there are no detectable differences in the 
slopes of the ICC s for the different groups of items, in which 
case an ICCSL test would not detect multidimensionality . 

That this may be the case was shown with generated data by 
Gustafsson and Lindblad (1978 ; cf. Brink, 1970). They demon- 
strated that the A-ICCSL test did not reject the Rasch model 
even for data generated according to an orthogonal 2-factor 
model, which in that case was due to the fact that every item 
related in the same way to a composite of the two latent 
variables involved. Of course, if this test and the other ICCSL 
tests fail to detect multidimensionality in generated data, it 
is also possible that they may fail to do so with empirical 
data. 

An example will be presented to show that this is not just a 
highly unlikely possibility, but that it may actually happen 
in reality. Muthen (1978) analyzed, as an illustration of a 
newly developed method for factor analysis of dichotomous 
items, 15 items in a questionnaire assessing the personality 
variable internal-external locus of control (Rotter, 1966) . 
There were data for 391 persons. The factor analysis showed 
that there were three lowly correlated factors among the 15 
items. 

The fit of these data"'"^to the Rasch model has been investigated 
with the A-ICCSL test^^ , and a very good fit was found (x^=22.4, 
df=28, p<.76). Since there is no reason to distrust the factor 
analysis it seems that the A-ICCSL test in this case is not a 
test of the unidimensionality of the items in the questionnaire. 

Additional support for this conclusion is obtained if the data 
are also analyzed with the ML-PCC test, with the items grouped 
into three scales according to their highest loading in the 
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factor analysis. There were 6,6 and 3 items in the scales. 
Applying the ML-PCC test to two scales at a time, the following 
results were obtained: 1 vs 2: x^~123.8, df=35, p<.00; 1 vs3 
X^=42.3, df=17, p<.00?2 vs 3:x^=60.0, df=17, p<.00. These results 
show clearly that the three scales measure different 
dimensions. 

In this case we must draw the conlusion that th3 ICCSL-tests 
are not sensitive to multidimensionality among the items, and 
a warning must be issued, to not accept fit to the model, as 
shovrn by an ICCSL test, as evidence of unidimensionality . 

To test this kind of multidimensionality in a more proper way 
within the framework of the Rasch model, the ML-PCC test should 
be used. That test, however, is a confirmatory test which 
requires that the items be grouped into sub-sets before any 
analysis is performed, and often the prior information is too 
weak to provide an adequate basis for this. In these cases, it 
does seem necessary to use factor analysis to obtain information 
about the dimensionality of the observations. 

It is well known that factor analysis of dichotomous items has 
many problems, both when phi-coefficients are used (Ferguson, 
1941) and when tetrachoric correlations are used (Gourlay, 1951; 
Lord & Novick, 1968, p. 349). Factor analytic methods specially 
designed for dichotomous items have, however, recently been 
developed (Christof f ersson, 1975; Muthen, 1978) . Statistically 
these methods are attractive but they involve great computa- 
tional complexities, which at present limits their usefulness 
to smaller sets of items (less than 20, say; Muthen, 1978). 

However, even though there are still unsolved problems in 
factor analysis of dichotomous items, the factor analytic 
methods are likely to give much information about the dimen- 
sionality and grouping of the items that is impossible to obtain 
in any other way. It should also be pointed out that even quite 
imperfect factor analytic methods can be used, since the results 
are checked with the ML-PCC test. Thus, for example, if a factor 
analysis of phi--coef f icients has produced "difficulty" factors 
(Ferguson, 1941) these can be detected with the ML-PCC test. 



Item bias 



If certain groups of items are systematically too easy or too 
difficult for certain sub-groups of the sample, this represents 
a special case of item heterogeneity which is referred to as 
item bias. An example of item bias may be that certain items 
favor the boys in a sample/ while certain other items favor the 
girls . 

Item bias can be detected in two ways. One possibility is to 
use the ML-PCC test, with the icems grouped into internally 
homogeneous scales which are supposed to give different 
"profiles" of performance level in different groups. The other 
possibility is to use the A-ICC test, with the sample of persons 
divided into groups, such as boys and girls. 

Speededness 

Speededness of the test is obviously a violation of the model 
assumptions since if a person does not have time to attempt an 
item, any statement about the probability of a correct answer 
as a function of ability is meaningless. 

In a speeded test the items early and late in the test measure 
different abilities as long as "speed" and "power" are not 
perfectly correlated, so speededness can be detected with the 
ML^PCC test, if a proper grouping of the items is used. 

Speedness is also possible to detect with the ICCSL tests. Persons 
with low raw scores do not even attempt the items late in the test, 
so those items will appear to have too high a discrimination (cf. 
Mead, 1976a, p. 9) . 

Guessing 

If guessing takes place, which is particularly likely when multiple- 
choice items with few response alternatives are used, the ICC 
cannot be represented with one parameter only; a model like the 
3-parameter model (Birnbaum, 1968) is needed to represent such 
Q data adequately. 
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Guessing can be detected with the ICCSL-tests if the items 
are of unequal difficulty and the persons have unequal ability: 
too many low-ability persons will answer the difficult items 
correctly, whereby they obtain too high a raw score, which in 
turn implies that on the easier items where the proportion of 
guesses is smaller, the low- ability persons will appear to 
perform too poorly. The easier items will thus appear to have 
too high a discrimination and the more difficult items will 
appear to have too low a discrimination. 

Mead (1976b, p. 96) showed that guessing also affects the 
apparent value of the person sensitivity parameters, so 
guessing can also be detected with the ML-PCCSL test. 

Non-independence of responses 

The assumption of local statistical independence implies that 
the response made by a person to an item must be independent 
of the responses to the other items in the test. This assumption 
can be violated in several different ways, such as by learning 
effects and by constrained responses. If, for example, four 
responses are derived from a question requiring the pairing 
with respect to meaning of four given English words with four 
given Swedish words, those of the examinees who know three of 
the answers will automatically have their fourth answer correct 
as well. Or, to take another example, if the answer given on 
one item affects the answer given on another item, the assump- 
tion of local statistical independence will be violated. 

As has already been pointed out, the assumption of local 
statistical independence is equivalent to the assumption of 
unidimensionality , and non-independence of responses can be 
detected with the ML-PCC test, if the items thought to be 
affected by such non-independence are grouped into one group, 
and the other items grouped into another group. 

Heterogeneous item discrimination 

The ICCSL-tests are by definition sensitive to variations in 
the discrimination of the items, so this kind of deviation 
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from the Rasch model can easily be detected. 

As has already been pointed out, it is, however, quite difficult 
to differentiate between violations of the assumption of uni- 
dimensionality and violations of the assumption of homogeneous 
item discrimination. This question was discussed at a rather 
abstract level in section 1, and here a few more comments will 
be made in relation to a concrete example. 

Gustafsson (1977, pp. 63-69) analyzed an inductive reasoning 
test composed of number series items and found that two items 
gave evidence of too low a discrimination. The quite obvious 
explanation was that these items posed a much higher demand 
for arithmetical skills than did the other items. 

The poor fit of these items was interpreted as being due to 
multidimensionality , which is reasonable according to any 
definition of unidimensionality . For example in a factor analysis 
the two items might define a factor of their own. However, 
had there been only one item of that kind the item set would 
have been unidimensional according to the Lord & Novick 
definition of unidimensionality, with a large item-specific 
component for the item posing high demands for arithmetical 
skills . 

This illustrates the very blurred line of distinction between 
multidimensionality and heterogeneous item discrimination and 
that models which allow the item discriminations to vary do 
not easily allow generalization beyond the specific set of 
items analyzed. 

Heterogeneous person sensitivity 

Lumsden (1977, 1978) drew attention to the fact that person 
reliabilities may differ. If that is the case, the person 
sensitivity parameter in the Lumsden model (1.4) would be 
different for different persons, which is a violation of the 
assumptions of the Rasch model. 



28 



This kind of threat against the Rasch model has not been studied 

at all, but the possibility of varying person reliability must 

be taken seriously, both for practical and for theoretical reasons. 

In principle, heterogeneous person sensitivity parameters can be 
detected with the ML-PCCSL test, but this is probably not the 
best way to study this kind of phenomenon. The test is likely 
to have a low power only, and it is sensitive to many other 
sources of threat as well. Furthermore, it is not likely that 
it will ever be possible to estimate the person sensitivity 
parameters, so not very much is gained by only knowing that 
they differ. 

A better approach may be to try to find another varirble, 
correlated with the person sensitivity parameters, a .d to use 
the A-ICC test with the sample grouped according to level of 
performance on this other variable. If the level of the person 
sensitivity parameters differs between the groups, it will be 
found that the item parameters are not invariant over the 
groups (cf. Lumsden, 1978). Such an approach would allow a 
more powerful test of the hypothesis, and a proxy for the 
person sensitivity parameters would be available. An important 
problem is of course what variables are likely to be related 
to intra-individual variability, but it does seem that personali- 
ty variables are useful? Rankin (1963), for example, found 
that the reliability of reading tests was higher for introverts 
than for extraverts. 

Should it be found that the person sensitivity parameters 
in ordinary applications do show a substantial variation, this 
would imply great problems from the point of view of the Rasch 
model, since it would not be possible to use the same model 
for all persons. Such a finding could be quite useful from a 
prediction point of view, however, within the framework of 
moderated regression (o.g. Ghiselli, 1965). 



Discussion 




A rather long list of possible sources of deviation from the 
Rasch model has been compiled, and no doubt the list could 
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be made even longer. However, two important conclusions emerge. 
The first conclusion is that it is possible, in principle at 
least, to detect the deviations from the Rasch model, even 
though at times an active search is necessary. The other con- 
clusion is that the ICCSL tests do not suffice to make a 
complete evaluation of fit. Nevertheless, tests sensitive to 
variations in the slopes of the ICC's are those that have been 
primarily used, and if such a test has not shown a poor fit, 
this has been taken as an adequate overall fit of the data. The 
ML-PCC test, which has never been used before, is, however, a 
necessary complement to such tests. 

5. Problems in the use of statistical tests to assess fit 

From the discussion above, the reader may have gained the 
impression that statistical tests can be used without any 
problems as long as they are in principle sensitive to a 
certain deviation. This is, of course, not so. In fact, the 
use of statistical tests is fraught with several problems, 
of which it is necessary to be aware. 

Very large samples form a special source of problems. This 
is because no model can ever be supposed to be perfectly 
fitted by data, so with a sufficiently large sample any model 
would have to be discarded. In connection with this problem 
Martin-Lor (1974a; stated: 

This indicates that for large sets of data it is too 
destructive to let an ordinary significance test decide 
whether or not to accept a proposed statistical model, 
because, with few exceptions, we know that we shall have 
to reject it even without looking at the data simply 
because the number of observations is so large. In such 
cases, we need instead a quantitative measure of the size 
of the discrepancy between the statistical model and the 
observed set of data... (p,3). 

Martin-L6f (1974a) derived such a measure, called redundancy, 
from concepts in the statistical information theory, which on 
an absolute scale measures the deviation between a statistical 
model and a set of data. This measure can thus be used when 
the fit of a large set of data is investigated, even though 
it does not seem very useful until there are tens of thousands 
O of cases (Gustafsson, 1977, pp. 57-61), at least not for short 
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Another way to come to grips with the problems caused by large 
sets of data is to replace the inferential methods with descrip- 
tive methods, based on graphical descriptions of the deviations. 
With some experience it it thus quite easy to use the GR-ICCSL 
test to judge the size of magnitude of the variations in the 
slopes of the ICC's. 

Problems are also caused by samples that are too small. Thus, 
the statistical tests are only asymptotically chi-square 
distributed, so with samples that are too small there is a 
risk that the test- statistic does not have the distribution 
assumed. 

It has been argued that the A-ICC test requires a large sample 
to be applied with confidence (Mead, 1976b, p. 34; Hambleton 
et al.,1977, p. 63). Some preliminary simulation studies indi- 
cate, however, that the asymptotic properties of this test 
apply reasonably well already with as few as 50-100 persons 
within each group (Gustafsson, 1977, p. 54-55) . The ML-ICCSL 
test, however, does not enjoy as good properties in this respect 
as does the A-ICCSL test. This is because the former test uses 
the results for each score group, while in the A-ICCSL test small 
score groups are pooled; therefore, the asymptotic properties 
come inte force for much smaller samples for the A-ICCSL test 
than for the ML-ICCSL test. It does seen wise to be cautious in 
interpreting the results from the ML-ICCSL test when any score 
group contains less than 10 observations, say. (Empty score groups 
cause no problems, however) . 

A greater problem caused by small samples is that the power 
of the test may be too lev; to detect ev^n sizeable deviations 
from the Rasch model. Since the power of the tests is a function 
of a large number of factors, it seems impossible to give any 
generally valid rules for the sample sizes needed to detect 
deviations of different sizes. However, to give some general 
information about the power of the tests and to study the effects 
on power of different factors, some simulation studies have been 
performed. 
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study I; The power of the A-ICCSL test against heterogeneous 



item discrimination 



Many of the deviations from the Rasch model appear as varying 
item discrimination, so it is important to have at least some 
rough information about the power of the ICCSL-tests against 
tiiis kind of deviation. Only small samples of person will be 
used, so only the A-ICCSL test will be studied. 

In the simulations the following factors were varied: 

Number of items : 15 and 30. 

Test design ; One set of "peaked" tests and one set of "spaced" 
tests were simulated. In the peaked tests all items had a 
difficulty of zero at the log scale. The spaced tests contained 
the difficulties -2,-1,0, 1 and 2, with three items at each 
level of difficulty in the 15-item tests, and with 6 items at 
each level of difficulty in the 30-item tests. 

Amount of deviabion : To simulate a small amount of deviation, 
the discrimination parameters 0.8, 1.0 and 1.2 were used, with 
each discrimination parameter being represented by the same 
number of items at all levels of difficulty. To simulate a large 
deviation from the model, the discrimination parameters 0.5, 1.0 
and 1.5 were used. (cf . Hambleton & Traub , 1971). 

Sample size : 150 and 300. 

Standard deviation (SD) of person parameters . The person 
parameters were sampled from two normal distributions with zero 
means, one with a small SD of .71 and one with a large SD of 
1.22. 

For each combination of levels of these factors 100 sets of 

data were generated according to the Birnbaum model, using the 

feedback shift , register generator (Lewis & Payne, 1973) as the 
3 ) 

basic generator . The data were analyzed with the A-ICCSL 
test, with the score groups grouped in such a way that the 
^ parameters were practically always estimated within two roughly 
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equal-sized sub-groups. In some cases it was impossible to 
compute the test (cf. Gustafsson, 1977, p. 49), so for some 
combinations the results are based on a lower number of 
replications than 100. 

The percentage of replications in which the p-value of the 
test was lower than .05 is shown in Table 1 for all the 
combinations of levels of the factors. 
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All the factors studied affect power. The sample size and 
number of items tend to influence power in the same way, at 
least for the peaked test, which shows that the number of 
responses analyzed is important. 

Deviations are more easily detected in a peaked test than in 
a spaced test. This is because the amount of information in 
a response is at a maximum when the probability of a correct 
answer is .50 and in a spaced test there are fewer such 
occurrences than in a peaked test. Had the mean of the distri- 
bution of person parameters been varied as well, a lower power 
would have been found when the mean of ability differs from 
the mean difficulty of the test. 

The SD of the person parameters strongly affects power. In 
fact, when the SD is zero, the test has no power whatsoever 
against this type of deviation (cf. Wright, 1977b). That this 
is the case is not always realized; Wood (1978) , for example, 
reported that the Rasch model fits random data — and seemed 
surprised at the finding. 

When a large amount of deviation is present in the data, the 
test provides an adequate power in almost all instances. The 
most notable exception to this is when the factors combine 
most unfavorably, i.e. a short and spaced test, a low SD and 
a small sample. 
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When there is a small to moderate amount of deviation, the 
power is adequate only in the most favorable combination of 
levels on the factors. 

One should hesitate to draw any general conclusions on the 
basis of a study as limited in scope as this one. It does 
appear, however, that as long as the SD of ability is not too 
low (around 1.0, say) and the difficulty of the test is 
adequate for the sample, a reasonable power to detect moderate 
heterogeneity of the item discriminations is obtained when 
10 000-20 000 responses are analyzed. 

Study II; The power of the A-ICCSL test against guessing 

If guessing is a factor affecting performance, this tends to 
affect the apparent value of the discrimination parameter in 
the Birnbaum model. For easy items a hic^x discrimination is observed, 
and for difficult items a low discrimination is observed. Some 
simulations have been performed to study the power of the A-ICCSL 
test to guard against this type of deviation from the Rasch model. 

It would make only little sense in making simulations on peaked 
tests when guessing is the threat; the test has any power only 
when there is some variation of the item, difficulties. Therefore, 
only spaced tests, designed in the same way as in Study I, were 
included. 

Only one amount of deviation was studied: all items were supposed 
to have a value of .20 on the guessing parameter in the 3-para- 
meter model (Birnbaum, 1968), and all the discrimination parameter; 
were assumed to be unity. 

Except for these changes in the design, the study was carried out 
in the same way as Study I, using the same levels on the other 
factors, except, of course, that the data were generated according 
to the 3-parameter model. 

The results are presented in Table 2. Again, all the factors 
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affect power and they do so, of course, in the same way as was 
found in Study I. It is found, however, that in no case is the 
power adequate for the IS-item test, and only with a high SD 
and a sample of 300 persons is the power large enough for the 
30-item test. 



Comparing the figures presented for the spaced test in Table 1 
with those presented in Table 2, it is found that with a guessing 
parameter of .20 the effect on the apparent discrimination is 
somewhat larger than what was labelled a small variation in the 
item discriminations. It would seem, however, that here too 
some 10 000-20 000 responses would be needed to detect presence 
of guessing of this amount, granted that the SD is not low and 
that there is a substantial variation in the item difficulties. 



Study III; The power of the A-ICCSL test against both heteroge - 
neous item discrimination and guessing 

Only rarely can it be suspected that there is only one kind of 
deviation from the Rasch model in the data. To study the povrer 
of the A-ICCSL test against two sources of deviation, a study 
was performed in which both guessing and varying item discri- 
mination was present. 

The same design as in Study II was used, except that the 
discrimination of the items was also varied, using the 
discrimination parameters 0.5, 1.0 and 1.5, with each discri- 
mination parameter being represented by the same number of 
items at all levels of difficulty. 

The data were generated according to the 3-parameter model and 
again 100 replications were used. 

The results are presented in Table 3. As compared with Study II 
the power is higher, as would be expected from the fact that 
another sizeable deviation has been introduced. But comparing 
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the results with those obtained in Study I, when a large amount 
of deviation was simulated for a spaced test, a lower power is 
found when guessing is also introduced. This is because these 
two kinds of deviation partly cancel out: the easy items with 
too low a discrimination and the difficult items with too high 
a discrimination obtain a more "normal" discrimination as 
consequence of the guessing. 

Examples could easily be constructed in which the effects of 
two deviations on the discriminations cancel out completely, 
resulting in no power whatsoever of the test to discover any 
of them. Of course, it is also possible for different deviations 
to work in the same direction, so that the deviations magnify 
each other. 



Discussion 



The simulation studies presented here indicate that the A-ICCSL 
test should be sufficiently powerful against alternative models 
of the 2- and 3-parameter type if samples of 500-1 000 persons 
are used and if the tests contain about 20-40 items. It must 
be kept in mind, however, that the SD of ability is a factor 
critically affecting power, as is the range of item difficulties 
when guessing is present. 

The possibility of trading relationships between different 
violations must be taken seriously. Using a goodness-of-f it test 
only, it is not possible to decide whether there are one or more 
deviations from the model, so this information must be taken 
from other sources. For some types of possible deviations this 
is not difficult. It should be possible to judge from the item 
type wheter or not a substantial amount of guessing is present , 
and if a test is speeded, there tends to be a large amount of 
omitted responses for the items late in the test. If such sources 
of deviation can be identified it should be seriously considered 
if any goodness-of-f it test should be carried out at all; it is 
already known that the Rasch model cannot be expected to fit the 
observations, and there is a risk that there will be trading 
relationships between those deviations, and others not so easily 
detected. 
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When the problem of too large samples was discussed, it was 
suggested that descriptions of deviations using graphical metods 
should be used. This recommendation also applies when there is 
a risk that the sample is too small; a deviation impossible to 
detect with a power-less statistical test may be possible to 
detect with a graphical test. 

Only the power for the A-ICCSL test has been investigated 
here, and similar investigations could be carried out for the 
other tests. It is not expected, however, that very different 
conclusions would be arrived at. Thus, the ML-PCC test seems 
quite powerful when "normal" samples are used, as long as there 
is some variation' in the abilities measured by the different 
groups of items. 

6. Evaluating f ib in differe nt m easurement contexts 

The question of fit is of course not an absolute one and it is 
quite obvious that the purpose for which the model is used should 
decide how to treat the goodness-of-f it problem. 

It does seem possible to make a distiction between two major 
classes of appliv-ation of the Rasch model into which the goodness- 
of--fit problem enters differently. In the first of these the 
Ilasch model is used as a criterion, against which characteristics 
of the observations themselves are evaluated. This kind of appli- 
cation is based on the fact that the Rasch model formalizes 
desirable characteristics of measurements, (i.e. unidimensiona- 
lity and sufficiency of raw score as an estimator of ability, 
cf . Gustafsson, 1977, Wright, 1977b),and fit to the model is 
used to draw inferences that the observations in fact enjoy 
these desirable characteristics. 

In the second class of applications, the Rasch model is used as 
an instrument to solve one or more practical measurement problems, 
such as linking and equating tests, carrying out tailored testing, 
optimizing tests, constructing item banks and so on (e.g. Wright, 
1977a). In this kind of application, the solution of a practical 
measurement problem is the main objective, and the characteristics 
of the observations themselves are important only to the extent 



that they help/prevent the achievement of the end. 

Evaluating fit when the Rasch model is used as a criterion 

It is fairly commonly accepted that in work with a theoretical 
orientation the scales into which observations are assembled 
should be homogeneous (e.g. Lord & Novick, 1968, p. 351). As 
was pointed out by Lumsden (1976) , the notion of unidimensiona- 
lity has, however, been seriously neglected both by constructors 
of tests and by test theorists. 

The unidimensionality assumption of the Rasch model, along 
with the availability of goodness-of-f it tests makes, in 
principle at least, this model useful in investigations of the 
unidimensionality of sets of observations. 



It can be noted, though, that doubts have been expressed as to 
the possibility of using the Rasch model as a criterion of 
unidimensionality. Speaking primarily about the Rasch model 
and the normalogive model. Wood (1976) stated: 

These item response models seem to be remarkably elastic 
concerning the motley collections of items they will fit 
(Wood, 1976, pp. 258-259). 

And: 

It looks as if, by one means or another, heterogeneous 
collections of items can be made to fit response models 
even though inspection strongly suggests that the items 
are not congruenc, as where groups of items call on 
psychologically distinguishable processes... (Wood, 1976, 
p. 260) . 

The background of these statements is almost certainly that 
incomplete evaluations of fit have been made. For the Rasch 
model only ICCSL tests have, no doubt, been employed, and it 
has already been shown that such tests may fail to detect even 
serious violations of the assumption of unidimensionality. 




Thus, it is obvious that when the Rasch model is used as a 
criterion, high standards of fit must be set, and it is 
necessary that several tests which each guard against different 
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deviations from the model are applied. In addition to ICCSL 
tests r the ML-PCC test should have a central place in this 
kina of application, since the latter test foanns a direct 
test of unidimensionality . 

This test, however, requires that the items are grouped into 
subsets before it is computed, which implies that infoannation 
about the dimensionality of the observations must be taken from 
other sources. It has already been suggested that factor analysis 
is useful in this context, but also infoannation derived from a 
careful scrutiny of the items and observations of solution 
processes, are likely to be useful (cf. Cronbach, 1970, pp. 
474-475). 

When the Rasch model is used as a criterion, the power of the 
statistical tests is essential. Whenever it is suspected that 
the power is insufficient, a closer look at the problem should 
be taken, perhaps through conducting a specially designed 
simulation study in which the characteristics of that particular 
situation are represented. 

The use of the Rasch model as a criterion is above all of 
interest in work with a theoretical orientation. This implies 
that the items must be constructed from theoretical starting 
points, and these theoretical notions should also direct the 
evaluation of fit. In such work the Rasch model is also likely 
to prove useful to test specific hypotheses about test items, 
without it being regarded a failure if the model is rejected. 

It is true that most test construction is essentially atheoretical 
and, as has been pointed out by Levy (1973), there is only a weak 
relation between test theory and psychological theory: 

Statistical manipulation of test results is sometimes used 
as a poor substitute for operational control of item content 
and format at the test development stage. Much needed are 
tests constructed to test hypotheses, and fewer hypotheses 
about tests (Levy, 1973, p. 37). 

This state of affairs is not likely to change as a function of 
adoption of the Rasch model. Should, however, a greater theoreti- 
cal sophistication come about among test constructors, it is 



likely that the Rasch model will be found to contribute to 
their work. 

Whitely and Dawis (1974) spoke in a similar vein: 

"♦..the lack of impact of the Rasch model in test develop- 
ment is due more to the current status of trait measurement 
than to the properties of the model. (p. 77) . 

Evaluating fit when the Rasch model is used as an instr ument 

The Rasch model can be used as an instrument to solve a range 
of practical measurement problems (e.g. Wright, 1977a). Here 
too, the fit of the data to the model is important, but the 
question of fit is nevertheless subordinated to the solution 
of concrete measurement problems. This implies that lower 
standards of fit can sometimes be set, that all possible 
deviations from the model assumptions need not necessarily be 
considered, and that in fact large deviations in the data from 
the model assumptions can sometimes be tolerated. 

If it is known that a set of data fit the Rasch model, it 
follows from the mathematical structure of the model itself 
that it can be used to solve practical measurement problems « 
The reason, however, why the Rasch model sometimes might be 
used as an instrument, in spite of poor fit, is that deviations 
from the model do not necessarily jeopardize applications. 
Unfortunately, very little is known about the robustness of the 
Rasch model against different types of deviations for different 
types of applications, and this is an area where much research 
is needed. 

Some research has been carried out, though, and it may be 
instructive to consider some of th^t in greater detail to see 
how the goodness-of-f it problem can be handled when the Rasch 
model is used as an instrument. 

The area of application where most research on the robustness 
of the Rasch model has been carried out is on the equating of 
tests, i.e. expressing on the Sujne scale raw scores obtained 
on different tests. In principle, this probi^ is easily solved 
with the Rasch model through firsts estimating the item parameters 

40 



for the two tests on a common scale; and then deriving the 
ability scales which specify the conversion of raw scores into 
estimates of ability (e.g. Wright, 1977a? Rentz & Bashaw, 1977). 

It has been shown (e.g. Wright, 1968? Whitely & Dawis, 1974) 
that estimates of the mean of ability for a group derived from 
easy and difficulty items in a test come quite close. In those 
studies the data did not fit the model, which indicates that 
the estimates of ability are quite robust against deviations 
from the model. 

However, Slinde <xr * Linn (1978) argued that it should also be 
possible in vertical equating of tests (i.e. equating tests of 
different difficulty) to use the item parameters estimated in 
any group of persons to estimate the abilities in any other 
group of persons. They compared the means of ability estimates 
obtained from easy and difficult tests for groups of different 
levels of ability, using item parameters estimated either within 
the same group of persons, or estimated within a group of persons 
of another level of ability. It was found that a reasonably good 
vertical equating could be achieved when the item-parameters 
estimated within the groups were used, but not when item-para- 
meters estimated within another group were used. On the basis 
- of these results, Slinde and Linn (1978) questioned the useful- 
ness of the Rasch model in solving the problem of vertical 
equating of tests. 

It should be pointed out that a partial explanation of the 
poor results obtained by Slinde and Linn (1978) is that they 
used an illegal grouping of the sample into levels of ability? 
they used performance on a subset of the items only as the 
basis for the grouping, a procedure which introduces a spurious 
lack of fit even when the data fit the model (Gustafsson, 1979b). 
However, Slinde and Linn (1979) have presented another study 
which allowed very much the same conclusion. 

The Slinde and Linn requirement that it should be possible to 
use the estimates of parameters from any group of persons is 
a reasonable one, since in some cases this is necessary in 
equating tests. It does seem rash, however, to draw a general 



conclusion about the inability of the Rasch model to solve the 
problem of vertical equating on the basis of a few empirical 
studies alone, and without supplying any reasons for the failure. 

Slinde and Linn (1978, 1979) suggested that an LT-model which 
allows the slopes of the ICC s to be different might be needed 
to solve the problem of vertical equating. Of course, in the 
presence of heterogeneous item discriminations the item para- 
meters will always differ when estimated within groups of 
different level of performance, but depending upon the exact 
the kind of violation of the assumption of homogeneous item 
discrimination, the biasing effects in vertical equating will 
be different. 

Some simple simulation studies have been performed to illustrate 
this. Data were generated to follow the Birnbaum model for three 
tests with 60 items in each, 30 of which had the difficulty -1 
and 30 of which had the difficulty 1. In one of the tests the 
item discriminations were not correlated with difficulty, there 
being 10 items each with discrimination parameters 0.5, 1.0 and 
1.5 at each of the levels of difficulty. This test will be 
referred to as the ZCORR test. In another test there were 10 
items with discrimination 1.0 and 20 items with discrimination 
1.5 among the easy items; among the difficult items there were 
10 items with discrimination 1.0 and 20 items with discrimina- 
tion 0.5. This test will be referred to as the NCORR test, since 
it simulates the case when there is a negative correlation be- 
tween discrimination and difficulty, such as is the case when 
the test items allow guessing. Finally, in the third test (PCORR) 
the frequencies of items with high and low discriminations were 
reversed at the two levels of difficulty as conpared with the 
NCORR test, to simulate a test with a positive correlation be- 
tween discrimination and difficulty, such as tends to be the 
case for a speeded test. 

For each of these three tests data were generated for 1 000 
persons, with the ability parameters being sampled from a normal 
distribution with zero mean and unit standard deviation. Persons 
with a score equal to 30 or lower formed a "low" group, and the 
rest of the sample formed a "high" group. The item parameters 
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for the total set of 60 items were estimated within the two 
groups, and the mean of ability for each group was estimated 
separately for the easy and difficult items, using the item 
parameters estimated both within the same group and the other 
group of persons. 

Table 4 presents, for the easy and difficult items separately, 
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the difference between the means obtained when using the item 
parameters estimated within the same group and those estimated 
within the other group. It could of course be argued that the 
differences between means obtained on easy and difficult items 
should be presented instead, since the problem of vertical 
equating is studied. In this case however, these are not direct- 
ly comparable, since some of the persons in the high group had a 
perfect score on the easy items and some of the persons in the 
low group had a zero score on the difficult items. 

The figures presented in Table 4 show that for the ZCORR test 
only small differences are found when the estimates of ability 
are based on item parameters estimated within groups of diffe- 
rent levels of ability. For the PCORR and NCORR tests, however, 
there is a large bias, with the direction of bias being diffe- 
rent depending upon the sign of the correlation between item 
difficulty and item discrimination. 

Using figures presented by Slinde and Linn (1979), the corres- 
ponding differences have been computed for that study. The 
pattern of differences found coincides with that found for 
the NCORR test, as might be expected from the fact that the 
test analyzed by Slinde and Linn (1979) was a multiple-choice 
test heavily influenced by guessing. 

This brief analysis thus makes it likely that the negative 
conclusions drawn by Slinde and Linn as to the possibility of 
using the Rasch model as an instrument in the vertical equating 



of tests was due to a negative correlation between item difficulty 
and item discrimination in that study. Had a test with the same 
amount of deviation but with a zero correlation with difficulty 
and discrimination been analyzed, a much more positive conclusion 
would have been arrived at. 

It must be stressed that this analysis of the robustness of the 
Rasch model is very limited in scope and allows very limited 
generalizations only. Thus f attention has been confined to the 
estimates of the mean of ability for groups of persons, but it 
is well known that in the presence of heterogeneous item discri- 
mination the Rasch model is less efficient than other LT models 
(Harableton & Traub, 1971? Reckase, 1978) . Parenthetically, it 
should also be pointed out that Andersen and Madsen (1977) have 
recently presented a superior solution to the problem of estima- 
ting the parameters of the latent population distribution. The 
robustness of that method against deviations from the Rasch 
model assumptions remains as yet to be studied. 

The purpose of this digression has been to show that the Rasch 
model sometimes is quite robust against deviations from the 
model assumptions, while at other times it is not robust at all. 
This suggests that when the Rasch model is to be used as an 
instrioment on data not fitting the model, the deviations from 
the model should first be analyzed and described, and it should 
then be investigated whether the model is robust against these 
deviations for the particular application intended. 

Of course, the Rasch model is best used as an instrument when 
the data fit the model. It should therefore also always be in- 
vestigated if it is possible to obtain fit of data to the model. 
Strategies for doing this are discussed in the next section. 

7. Obtaining fit of data to the Rasch model 

It does appear that, on the whole, a rather simple strategy is 
followed to obtain fit of data to the Rasch model. This standard 
procedure may be described in the following, somewhat simpli- 
fied, way: a set of items is given to a sample of persons and 
an overall ICCSL test is computed. If this test is significant ^ 
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which is usually the case, the p-value of fit to the model of 
each item is computed, or a graphic test of item fit is made, 
and those items which do not fit are excluded. A new overall 
ICCSL test is then computed, usually with the same sample of 
persons, and unless a non-significant value on the test 
statistic is obtained, the process is carried out again, ex- 
cluding more items, until a reasonably good overall fit is 
obtained. 

It is submitted here that this strategy is likely to result 
in a spurious fit only, and that it should only rarely be used. 
In view of current practice this is a strong assertion, but 
several reasons can be cited in support of it. 

One reason ic of course, as has been shown above, that the 

ICCSL tests represent only a partial evaluation of fit to the 

model, and they can fail to detect even very serious deviations 

from the Rasch model. Other tests, and above all the ML-PCC 

test, should therefore also be used to study item heterogeneity. 

Another reason why the strategy based on exclusion of items 
should not be used is that there may be trading relationships 
between different violations of the model assumptions, as was 
shown in Study III in section 5. Consider for example a slightly 
speeded multiple-choice test with heterogeneous items ^ Speeded- 
ness and guessing tend to affect the discriminations in opposite 
directions and item heterogeneity may also affect the discrimina- 
tions. It is very likely that a large proportion of the items in 
such a test which do show a good fit do this because the effects 
of the different violations cancel out. If "poor-fitting" items 
are excluded, a good overall fit, as evidenced by an ICCSL test 
will eventually be obtained, but that good fit has been obtained 
through capitalizing on such trading relationships, and on chance 
effects. When this kind of "fit" has been obtained, the implica- 
tions which are otherwise associated with fit of data to the 
Rasch model do not hold true. 

A third reason why items should not be routinely exclued is that 
there may be deviations from the model where other steps should 
be taken to obtain fit. If, for example, the main reason for the 
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poor fit of a set of items is that the examinees have been 
given too short a testing time, the best way to obtain fit is 
to give the test with a more liberal time limit. Or, to take 
another example, if the test consists of multiple-choice items 
with a few response alternatives on which the subjects have 
been given the instruction of guess if they do not know the 
correct answer, it does not seem wise to select those items 
which appear to fit the model in spite of the guessing? instead 
the opportunities to guess at all should be minimized if the 
Rasch model is to be used. 

But there is also a fourth, and even more important reason 
why development of Rasch scales on the basis of exclusion 
of poor-fitting items cannot be recommended as a general 
strategy. This is because tcGts of the fit of single items- in 
the presence of gross deviations from the model, are in principle 
illogical: the basic requirement of the Rasch model is that the 
items shall be homogeneous, so what is tested is, in fact, if the 
items fit with each other, not if they fit the model. If tests 
of item fit indicate that just a few of the items do not fit, 
this can of course be interpreted as showing that these items 
do not fit with the other items, and hence not the model. But if 
a larger proportion of the items show misfit, the item set is 
so heterogeneous that there may be subset of items in the set, 
each of which shows a good fit to the model, but which do not 
fit with each other. 

Suppose for example, that a set of items all measure the same 
ability but that they have different discrimination parameters 
(which is, of course, a highly hypothetical situation) . If items 
are excluded on the basis of tests of item fit, those items will 
be retained which have an intermediate level of discrimination. 
But there is no assumption in the Rasch model which says that 
items shall have an intermediate level of discrimination; all 
that is required is that the items shall be homogeneous with 
respect to discrimination. Thus it may well be possible to 
select a subset of highly discriminating items which fit the 
model. If the scale is to be used to measure individual 
differences, such a scale composed of highly discriminating 
itmes will have better properties than a scale composed of 



items with an intermediate discrimination, at least if the 
discrimination is not so high, and the difficulties not so 
uniform that the attentenuation paradox appears (Loevinger, 
1954) . 

In passing, it can be noted that studies have been carried out 
(e.g. Tinsley & Davis, 1972) in which the Rasch model has been 
compared with other methods for item screening. In these studies 
the tests of item fit have been used to select items for the 
Rasch scales, and it has not been realized that items which 
appear to have too high a discrimination could have been selected 
just as well. 

Other examples where tests of the fit of single items may give 
absurd results are easily envisaged. If, for example* a set of 
items is heterogeneous in the sense that two dimensions are 
covered, an ICCSL test may, but need not, indicate a poor fit. 
But if a process of item selection is carried out we will end 
up, at best, with a scale covering only one of the dimensions 
in the original set. What should be done in such a case is of 
course to sort the items into internally homogeneous subsets 
each of which will show a good fit to the model. 

Gustafsson and Lindblad (1978) presented an empirical example 
of that situation. In analyses of a test of English grcimmar 
for Swedish students it was found that a set of items measuring 
knowledge of irregular verbs did not fit the model. But in a 
separate analysis of these items it was found that they did fit 
the model, as did the rest of the items, after some poorly 
constructed items had been excluded. Had the items measuring 
knowledge of irregular verbs been excluded, that would have 
implied an undue narrowing of the scope of the test, but through 
forming two scales instead of one, both kinds of items were re- 
tained. 

The Rasch model has been critisized by several authors (e.g. 
Goldstein & Blinkhorn, 1977? Whitely, 1977? Wood, 1978) because 
it has been thought that the strong assumptions of the model 
make it necessary to exclude items not fitting the model. Wood 
(1978), for example^ said: 



By narrowing the scope of the tests in order to fit 
the Rasch model, we may run the risk of throwing out 
the baby with the bath water, even though the measure- 
ments have desirable, perhaps even necessary, proper- 
ties .... (Wood, 1978, p. 31). 

This criticism is warranted if it is assumed that only one scale 
is to be used, but not otherwise? any degree of heterogeneity 
can be represented with the Rasch model as long as several dif f e 
rent scales are constructed (cr. Lumsden, 1976, p. 267). 

From the list of problems associated with the exclusion of poor- 
fitting items to obtain fit, the skeleton of an alternative 
strategy can be outlined. First of all the likely causes of the 
poor fit should be identified. If among the likely sources of 
deviation there are factors other than item heterogeneity, the 
proper actions should be tr^ken to remove those threats against 
the model (i.e. remove speededness, guessing and so on). It 
should then be investigated if the item heterogeneity is so 
severe that the items should be grouped into homogeneous 
subsets, or if a few poorly constructed items can be excluded 
to obtain fit. In the next step,- any suggested scale should be 
cross-validated on another sample of persons with further items. 

In order for such a strategy to be successful, a very good 
kowledge of the sample, the testing situation and the content 
of the items is necessary; otherwise it will be impossible to 
trace the different sources of deviation and to group the items. 
The goodness-of-f it tests are likely to contribute in the 
evaluation of fit, but they can certainly not replace subject 
matter knowledge. 

Concluding remarks 

If anything, it should stand clear from the discussions in this 
paper that it is difficult both to evaluate and to obtain fit 
of data to the Rasch model. It can only be hoped that this does 
not detract users from the Rasch model, because if used properly 
there are sometimes large theoretical and practical gains to be 
made, and especially so if the goodness-of-f it problem is given 
due attention. 
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Closely associated with the Rasch model is the theory of specific 
objectivity (Rasch, I960, 1961, 1977) which says that it should 
be. possible to compare objects (persons) independently of agents 
(items) end agents independently of objects. When data fit the 
Rasch model specifically objective comparisons of items can be 
made, as well as specifically objective comparisons of persons. 
But users of the Rasch model must bear in mind the following 
caution, made by Rasch himself: 

In an empirical science specific objectivity can never 
be fully ascertained if the objects and/or agents is 
an infinite set; it can only be set up as a working 
hypothesis which has got to be carefully tested, e.g. 
by exposing an extensive body of objects to a wide 
range of agents and analyzing the reactions. And 
whenever additional data are collected we must be 
ready to do it over again — possibly having to 
revise previous optimistic conclusions. (Rasch, 
1965, p. 8, with some changes of notation) . 
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FOOTNOTES 



1) I want to thank Bengt Muthen for kindly giving me access 
to these data^ and also those persons acknowledged by 
MuthSn (1978) for having originally contributed the data. 

2) These computations , and all others reported in this paper , 
were made with a FORTRAN IV computer program (PML3) , written 
by the present author for use on IBM 360/370. PML3 computes 
the CML estimates of the item parameters, and estimates of 
the person parameters. The program also computes all the 
goodness-of-f it tests presented here, except for tests of 
person fit. A copy of the program written on tape may be 
obtained at cost from Jan-Eric Gustafsson, Institute of 
Education; University of Goteborg , Pack, S-431 20 MOLNDAL, 
Sweden. 

3) I want to thank Philip Ramsey, now at the City University 
of New York; for putting into my hands this exellent random 
number generator. 
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Table 1 



Percentage of successful replications in which the 
A-ICCSL test rejected the Rasch model at the 5 
percent level in the presence of heterogeneous 
item discrimination. 



SMALL AMOUNT OF DEVIATION 
Test design 

Peaked Spaced 

Number of itens Number of items 

15 30 15 30 

Sample size Sample size Sample size Sample size 



SD 


150 


300 


150 


300 


150 


300 


150 


300 


Lew 


11 


15 


15 


44 


6 


15 


11 


21 


High 


26 


51 


44 


80 


13 


33 


22 


60 



LARGE AMOUNT OF DEVIATION 
Test design 
Peaked Spaced 



Number of itons Number of itans 

15 30 15 30 





Sample size 


Sample size 


Saitple size 


Sample 


size 


SD 


150 300 


150 300 


150 300 


150 


300 


Lew 


53 89 


89 99 


28 67 


57 


64 


High 


99 100 


100 100 


81 98 


97 


100 



51 
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Table 2 



Percentage of successful replications in which the 
A-ICCSL test reject the Rasch model at the 5 percent 
level in the presence of guessing. 

Number of items 
15 30 



Sample size Sample size 

SD 150 300 150 300. 

Low 12 16 15 32 

High 26 48 59 96 



Table 3 

Percentage of successful replications in which 
the A-ICCSL test rejected the Rasch model at 
the 5 percent level in the presence of guessing 
and varying item discrimination. 

Number of items 
15 30 



Sanple size Sample size 

^ 150 300 150 300_ 

Lew 10 35 32 77 

High 53 94 90 100 



Er|c 3^ 
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Table 4 



Differences between estimates of means of 
ability using parameters estimated within 
the same group of persons and parameters 
estimates v/ithin the other group of persons 



TEST 



2C0RR PCORR NCORR Slinde & Linn (1979! 



Easy Itans 










Lew 


-.02 


-.45 


.39 


• 62 


High 


.14 


.50 


-.36 


-.30 


Difficult items 










Lew 


-.13 


.40 


-.44 


-.48 


High 


.02 


-.44 


.40 


.62 
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